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ABSTRACT 

In  processing  large  volumes  of  speech  and  language  data,  we 
are  often  interested  in  the  distribution  of  languages,  speakers,  topics, 
etc.  For  large  data  sets,  these  distributions  are  typically  estimated  at 
a  given  point  in  time  using  pattern  classification  technology.  Such 
estimates  can  be  highly  biased,  especially  for  rare  classes.  While 
these  biases  have  been  addressed  in  some  applications,  they  have 
thus  far  been  ignored  in  the  speech  and  language  literature.  This  ne¬ 
glect  causes  significant  error  for  low-frequency  classes.  Correcting 
this  biased  distribution  involves  exploiting  uncertain  knowledge  of 
the  classifier  error  patterns.  The  Metropolis-Hastings  algorithm  al¬ 
lows  us  to  construct  a  Bayes  estimator  for  the  true  class  proportions. 
We  experimentally  evaluate  this  algorithm  for  a  speaker  recognition 
task.  In  this  experiment,  the  Bayes  estimator  reduces  maximum 
RMSE  by  a  factor  of  five.  Performance  is  furthermore  more  con¬ 
sistent,  with  range  of  RMSE  reduced  by  a  factor  of  4. 

Index  Terms —  knowledge  acquisition,  Monte  Carlo  methods, 
speech  processing 

1.  INTRODUCTION 

There  is  increasing  interest  in  characterizing  links  in  a  communi¬ 
cation  network,  not  simply  in  terms  of  message  count  but  by  con¬ 
tent.  For  example,  what  proportion  of  internet  traffic  is  peer-to-peer? 
There  may  be  little  or  no  prior  knowledge.  For  communication  be¬ 
tween  humans,  characterization  can  involve  any  of  the  standard  tasks 
in  language  processing.  We  might  assign  a  categorical  label  (e.g. 
language,  speaker  or  topic)  to  linguistic  content  encoded  in  audio, 
text  or  document  images,  then  focus  on  the  distribution  over  these 
categories.  Histograms  of  these  distributions  provide  useful  sum¬ 
mary  statistics  to  help  humans  cope  with  information  overload  [  1 J . 

Automated  classifiers  have  many  uses,  but  their  output  is  typi¬ 
cally  biased  due  to  classification  errors.  Proportional  bias  increases 
as  the  frequency  of  a  class  decreases.  For  example,  consider  some 
binary  task  with  5%  false  alarm  rate  and  negligible  missed  detec¬ 
tions.  If  20%  of  the  data  is  truly  from  the  target  class,  around  24% 
of  the  data  will  be  hypothesized  as  such  by  the  classifier  due  to  false 
alarms.  This  is  incorrect,  but  perhaps  still  useful.  However,  for  a  true 
value  of  0.01%,  the  expected  5%  hypothesized  proportion  is  wrong 
by  orders  of  magnitude.  This  large  proportional  bias  is  unsatisfac¬ 
tory,  especially  in  applications  where  rare  events  are  of  interest. 

Given  the  classifier  error  rates,  it  is  straightforward  to  estimate 
the  most  likely  class  proportions  via  the  E-M  algorithm.  These  can 
be  estimated  front  some  sample  set  with  manual  annotation.  How¬ 
ever,  estimates  based  upon  finite  data  have  some  degree  of  uncer¬ 
tainty.  Optimal  decisions  can  require  understanding  of  variance  — 
the  most  likely  target  class  proportion  may  be  20%,  but  how  plausi¬ 
ble  is  19%,  or  10%?  This  is  a  well  understood  problem  in  statistics. 
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given  the  assumption  that  the  test  data  is  drawn  from  the  same  popu¬ 
lation  as  the  training  data  [2].  To  provide  variance  information  rather 
than  a  simple  point  estimate  requires  a  different  technical  approach. 

A  hierarchical  Bayes  model  for  the  true  class  proportions  can 
incorporate  error  rate  uncertainty.  The  Metropolis-Hastings  (M-H) 
algorithm  [3]  allows  us  to  construct  the  posterior  distribution  of  true 
class  proportions.  The  posterior  mean  provides  a  Bayes  estimate  of 
the  class  proportions,  while  posterior  variance  provides  confidence 
bounds  on  the  estimated  proportion. 

2.  RELATED  RESEARCH 

Issues  of  data  summarization  when  using  a  classifier  have  not  been 
a  traditional  focus  of  Human  Language  Technology  (HLT)  research. 
An  appropriate  model  for  classifier  errors  is  presented  in  [2];  this 
work  however  does  not  address  the  issue  of  estimation  based  upon 
uncertain  error  rates.  The  bias  inherent  in  hard  classifier  output  has 
been  ignored  by  the  speech  and  language  processing  community 
(thus  such  works  such  as  [4]  analyze  output  label  rather  than  true 
class  proportions).  The  work  [5]  also  seeks  a  methodology  that  is 
valid  for  all  possible  class  proportions;  it  further  provides  an  HLT 
engineer’s  sketch  of  Bayesian  decision  theory.  Their  interest  how¬ 
ever  is  on  calibration  of  score  likelihoods  conditional  on  class,  anal¬ 
ogous  to  our  confusion  matrix,  rather  than  updating  the  hypothesis 
prior  distribution.  Natural  language  processing  uses  complex  classi¬ 
fiers  and  machine  learning  techniques,  but  corpus  summary  statistics 
have  not  been  a  primary  concern. 

Research  areas  involving  high-speed  high-volume  data  streams 
(such  as  internet  traffic)  focus  more  on  issues  of  speed  and  scalabil¬ 
ity.  Recently  there  has  been  convergence  with  HLT.  Content  mining 
techniques  are  increasingly  used  to  monitor  networks  [6J,  while  there 
is  ongoing  research  on  fast  language  processing  scalable  to  massive 
data  streams.  [7]  describes  one  application  in  (text-based)  language 
and  topic  identification.  As  high-volume  data  processing  incorpo¬ 
rates  imperfect  classifiers,  classifier  bias  can  seriously  impact  data 
analysis. 

The  medical  literature  recognizes  the  issue  of  classification  bias; 
some  authors  use  confusion  matrix  inversion,  assuming  known  error 
rates  [8],  A  few  works  note  that  this  is  unrealistic  [9],  In  particular, 
the  technical  approach  of  [10]  is  very  similar  to  ours.  Their  paper 
considers  only  two  classes  and  relies  on  a  Gibbs  sampling  scheme 
dependent  on  conjugate  priors,  but  is  readily  extensible  to  more  gen¬ 
eral  classification  problems.  These  results  seem  to  be  unknown  out¬ 
side  of  the  epidemiology  literature. 

Our  technical  problem  requires  deducing  true  class  proportions 
from  the  classifier’s  hypothesized  proportions  and  estimated  error 
patterns.  From  this  perspective  our  solution  simply  adapts  standard 
Bayesian  techniques  to  a  particular  mixture  problem.  The  justifi¬ 
cation  for  using  Markov  Chain  Monte  Carlo  (MCMC)  numerical 
estimation  is  well-understood  [3],  but  the  practice  involves  some 
art  [11]  [12], 
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3.  ESTIMATING  CLASS  PROPORTIONS 


RMSE  of  Hypothesized  Classes 


3.1.  Introduction 

We  measure  estimator  performance  via  mean  squared  error  (MSE). 
In  this  section,  we  show  that  hypothesized  class  proportions  act  as 
a  shrinkage  estimator  towards  the  fixed  eigenvector  of  the  classifier 
error  rate  matrix.  This  introduces  uncontrolled  bias  and  lack  of  pre¬ 
dictability  into  the  MSE. 

In  incorporating  a  model  of  error  rates,  the  Bayes  estimator  de¬ 
scribed  in  this  paper  gains  some  desirable  statistical  properties.  It 
is  consistent  in  the  sense  that  given  unlimited  data  it  must  converge 
to  the  truth.  It  is  admissible  (no  strictly  lower  risk  estimator  exists) 
since  it  is  Bayesian  for  a  particular  prior  [13].  By  construction  it 
has  minimum  expected  squared  error  loss  under  explicit  prior  beliefs 
about  the  parameters. 

3.2.  The  Distribution  of  Classifier  Hypotheses 

Denote  by  x  the  value  of  the  true  class  label  for  some  observation, 
and  by  y  the  hypothesized  class  label  from  the  classifier  output.  As¬ 
sume  multinomial  samples  x  and  y  with  associated  class  probability 
vectors  V  and  W  respectively,  where  V  =  {vt  =  P(x  =  i)}  and 
W  =  {t m  =  P(y  =  i)}.  Improved  classifier  performance  brings 
W  closer  to  true  V,  but  accurate  estimation  of  V  is  possible  for  im¬ 
perfect  classifiers  given  accurate  knowledge  of  classifier  error  rates. 

For  a  given  data  set  and  classifier  we  have  a  model  with  class- 
conditional  error  probabilities 

Cij  =  P(y  —i\x  —  j) 

The  dj  are  independent  of  the  (unknown)  true  distribution  of  x. 
Given  probability  vectors  V  for  the  true-class  distribution  and  W 
for  the  hypothesized-class  distribution,  this  leads  to  the  multinomial 
parameter  equation 

W  =  CV  (1) 

The  matrix  C  has  an  eigenvalue  1,  thus  at  least  one  ‘fixed’  eigen¬ 
vector  Vf  such  that  CVf  =  Vf-  This  Vf  is  unique  so  long  as 
the  Markov  process  defined  by  transitions  C  is  ergodic  (irreducible 
with  recurrent  aperiodic  states).  A  sufficient  (though  not  necessary) 
condition  is  if  no  entries  of  C  are  zero.  In  such  a  case,  Vf  is  the 
unique  attractor  for  all  probability  vectors  V  under  the  action  of  C: 
limn-,00  CnV  =  Vf-  This  creates  the  bias  in  hypothesized  ver¬ 
sus  true  class  proportions  —  other  vectors  V  are  drawn  towards  Vf- 
Thus,  W  =  CV  differs  from  V  except  at  Vf- 

Given  a  set  Y  of  observed  classifier  output,  we  denote  by  y(i) 
the  classifier  hypothesis  for  observation  i,  where  y(i)  £  {1,  A'}  for 
a  classifier  with  K  categories.  Denote  the  number  of  observations  in 
Y  by  Ny-  We  abuse  notation  and  let  Y  further  denote  the  vector  of 
hypothesized  class  counts,  so  Y  ~  Multi(A^-,  W).  Given  Y,  we 
have  W  =  Y/Ny  the  relative  frequency  estimator  for  W.  Thus  W 
is  a  random  variable,  while  W (Y)  is  a  fixed  value.  The  expected 
MSE  of  W  as  an  estimator  of  true  proportions  V  has  the  classic 
decomposition: 

a[(JF-F)2]  =var(W0+  \e{W-V)^  (2) 

Consider  the  2-class  case.  When  the  number  of  observations 
Ny  is  large,  then  var(rhi)  is  small,  squared  bias  dominates  the  MSE, 
and  the  root  mean  squared  error  (RMSE)  of  wi  ~  |A(wi)  —  vi\  . 
For  smaller  Ny,  var(u;i)  contributes  to  RMSE.  Figure  1  shows  an 
example.  Estimator  wi  suffers  from  uncontrolled  bias  due  to  the 


Fig.  1.  RMSE  of  wi  for  10%  EER  classifier,  high  and  low  variance. 


shrinkage  of  V  towards  Vf-  The  shrinkage  depends  on  both  C  and 
unknown  V,  so  RMSE  cannot  be  predicted  without  an  explicit  model 
for  C.  Compensating  for  the  bias  by  estimating  C  provides  a  more 
predictable  RMSE. 

3.3.  Hierarchical  Bayes  Model 

Estimation  of  error  rates  C  is  typically  done  from  some  manually 
labeled  corpus  L,  with  Uj  the  number  of  observations  with  true  class 
j  and  hypothesized  class  i.  The  distribution  of  the  parameter  V  de¬ 
pends  on  the  distributions  of  W  and  C ,  which  in  turn  depend  on  L 
and  Y .  A  hierarchical  Bayes  model  can  exploit  priors  not  only  on 
the  parameter  of  interest,  but  on  the  other  parameters  on  which  its 
distribution  depends. 

We  model  true  class  and  class-conditional  output  labels  as  multi¬ 
nomial  random  variables.  Flat  priors  allow  us  to  model  P(W\Y)  as 
a  Dirichlet  and  P{C\L)  as  a  hyper-Dirichlet  distribution.  Joint  dis¬ 
tribution  P(C,  W\L,  Y)  is  more  complicated  in  that  the  domain  of 
W  depends  on  C.  Changing  coordinates  to  P(C,V\L,Y)  elimi¬ 
nates  that  issue,  but  data  Y  provides  information  on  CV  rather  than 
directly  on  V.  Thus  we  construct  posterior  P(C,  V\L,  Y)  via  ran¬ 
dom  sampling. 

3.4.  Metropolis-Hastings  Estimation  of  V 

Our  goal  is  to  estimate  the  distribution  P(V\L,  Y),  where  V  is  the 
vector  of  true  class  proportions  given  data  L  and  Y.  We  have  no 
analytic  solution  for  P(V\L,  F),  but  do  have: 

P(C,  V\L,  Y)  oc  P0(C,  V)P(Y\W  =  CV)P{L\C) 

for  prior  Po(C,  V)  by  Bayes  Rule.  We  generate  random  samples  of 
C  and  V  according  to  probabilities  P(C \V,  L,  Y)  and  P(V\C,  Y). 
We  recover  P(V\L,  Y)  by  projecting  onto  the  marginal  distribution. 

The  M-H  algorithm  provides  a  Monte  Carlo  method  for  gener¬ 
ating  samples  that  are  provable  convergent  to  a  target  distribution. 
Denote  some  parameter  space  by  X  and  the  (computable)  probabil¬ 
ity  distribution  by  q(x).  M-H  performs  a  random  walk  in  X  via  a 
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transition  kernel  tt(x,x').  The  transition  kernel  defines  a  Markov 
chain,  which  under  suitable  conditions  (i.e.  ergodicity)  is  guaran¬ 
teed  to  converge  in  probability  to  the  target  distribution  q(x).  See  [3] 
and  [14J  for  more  details. 

We  generate  a  correlated  sample  of  size  T  as  follows: 

1.  Set  initial  C0  =  C(L),  Vo  =  Cq1W(Y). 

2.  For  t  in  1  to  T: 

(a)  Select  candidate  C’  via  independent  transition  kernel 
ttc  =  P(C'\L). 

Define  =  C'Vt- 1  and  Wc,t- 1  =  Ct-iVt-i 
Thus  ac  =  P(W'c\Y,  L)/P(Wc,t-i\Y,  L) 

Accept  Ct  =  C'  with  probability  min(ac,  1). 

(b)  Select  candidate  V'  via  the  transition  ny. 

Define  Wy  =  CtV'  and  Wv,t-i  =  CtV t-i 
Thus  av  =  P(Wir\Y,  L)/P(Wv,t-i\Y,  L) 

Accept  Vt  =  V'  with  probability  min(ay ,  1). 

This  random  walk  in  (C,V)  will  converge  to  P(C,  V\Y,  L).  Se¬ 
quence  Vt  is  guaranteed  to  converge  to  the  marginal  distribution  of 
interest,  P(V\Y,  L ).  Given  K  classes  this  is  0(K2T)\  the  number 
of  classes  that  can  be  considered  in  practice  is  limited  by  the  amount 
of  labeled  data  L  to  estimate  C  rather  than  algorithmic  complexity. 

4.  EXPERIMENTAL  EVALUATION  ON  SPEAKER  ID 

4.1.  Introduction 

In  this  section,  we  experimentally  evaluate  the  M-H  algorithm  on  a 
target/non-target  speaker  identification  (SID)  task  derived  from  the 
Switchboard  corpus  [15].  We  will  construct  this  task  by  randomly 
selecting  100  speakers  (out  of  nearly  500)  to  constitute  a  modeled 
target  set,  with  the  remaining  open-set  (unmodeled)  denoted  as  non¬ 
targets. 

We  evaluate  the  RMSE  as  a  function  of  true  target  proportion  vi , 
comparing  the  RMSE  curves  for  W*  and  V*.  Randomized  train¬ 
ing  (L)  and  test  (Y)  sets  are  generated  for  values  of  v\  in  the  inter¬ 
val  [0, 1],  Using  the  algorithm  of  the  previous  section,  we  estimate 
P(V\L,  Y)  for  each  data  set  and  compute  the  RMSE  as  a  function 
of  r>i .  We  then  compare  the  RMSE  of  the  hypothesized  proportion 
(wti),  and  of  the  Bayes  estimated  proportion  (ui),  as  functions  of  the 
unknown  true  vi. 

4.2.  Data  and  Experimental  Set-up 

Andrews  and  Hernandez  [16]  provided  SID  scores  for  a  subset  of 
Switchboard,  using  the  algorithm  front  [17],  In  particular,  there  are 
4837  different  voice  cuts  representing  483  different  speakers.  To 
create  a  target  set,  100  speakers  were  selected  at  random.  To  pro¬ 
vide  a  task  with  non-negligible  error  rate,  only  two  trained  models 
were  retained  for  each  of  the  target  speakers  (i.e.  200  models).  No 
individual  models  were  retained  from  the  open-set  (non-targets).  We 
defined  a  simple  binary  classifier  with  parameter  T  as  follows.  For 
each  voice  cut: 

1.  Find  model  scores  {s^}  for  the  target  speakers  (200  scores), 

2.  If  max(si)  >  T  then  classify  the  voice-cut  as  “Target”,  else 
classify  as  “Non-target.” 

The  resulting  classification  task  has  an  equal  error  rate  around  5%. 

We  estimate  RMSE(u*)  over  various  values  of  V\  as  follows. 
Generate  random  partitions  of  the  5K  voice  cuts  into  training  and 
test  sets.  Denote  the  training  sets  by  Li,  where  the  number  of  voice 


Switchboard  Target/Non-Target  Classification 
(2  Models  per  Speaker,  Equal  Error  Rate) 
Train:4111  Test:726 


Fig.  2.  Operating  point:  C12  =  C21  =  0.053 


cuts  Nl  is  constant  (4111).  Denote  the  test  sets  by  Yi,  where  the 
number  of  voice  cuts  Ny  is  also  constant  (726). 

The  value  of  ‘true  target  proportion’  v%  1  is  controlled  by  con¬ 
strained  generation  of  the  Yi.  Denote  by  A';i  the  number  of  true 
target  cuts  in  the  data  set  Yi,  where  the  true  proportion  of  target 
speakers  in  that  data  set  is  given  by  vn  =  Xn/Ny.  Denote  by 
wn  =  Yn/Ny  the  hypothesized  target  proportion  in  Y%. 

For  each  partition  we  estimate  P(V\Y,  L)  via  M-H.  Denote  by 
Nvi  the  number  of  partitions  ( Li,Yi )  with  common  v\  (100  in  our 
experiments).  The  true  vn  is  known  for  each  partition.  This  provides 
an  empirical  measure  for  the  RMSE  of  an  estimator  at  fixed  true 
target  proportion:  RMSE(wJ|ui)  =  E„il=vl  (v*i  ~  Vi)2/Nvi)1/2 
and  similarly  for  RMSE(wi|vi). 

We  present  RMSE(vJ  |ui)  and  RMSE(ui]'|'!;i),  based  upon  100 
random  partitions  generated  for  every  (approximate)  percentile  value 
of  vi.  Figure  2  shows  the  two  curves  at  the  EER  operating  point 
C12  =  C21  =  0.05.  Observe  that  RMSE(wi)  is  quite  unpredictable, 
ranging  between  0.01  and  0.05  depending  on  the  true  value  of  vi. 
RMSE(u*)  is  both  significantly  lower  and  more  predictable.  The 
maximum  of  RMSE(ui )  is  a  factor  of  5  smaller  than  the  maximum  of 
RMSEfiuJ).  Furthermore,  measuring  the  predictability  of  the  errors 
by  range,  then  0.007  <  RMSE(ui)  <  0.017  at  the  equal  operating 
point,  while  0.006  <  RMSE(wjJ)  <  0.053.  This  gives  a  range  of 
0.01  versus  0.047,  or  a  75%  relative  reduction  in  the  range  of  v*. 
Figure  3  shows  the  estimator  RMSE  curves  when  the  false  alarm 
rate  (C12)  is  2%  and  the  missed  detection  rate  (C21)  is  9%. 

4.3.  Value  Estimation  on  Streams 

One  important  problem  is  to  identify  which  of  several  data  streams 
has  the  greatest  proportion  of  some  target  class.  If  all  streams  are 
have  the  same  classifier  error,  the  best  source  of  the  target  class  is 
the  one  with  the  highest  observed  w*.  In  practice  however  streams 
often  differ,  for  example  due  to  noise  and  channel  effects.  In  these 
cases  hypothesized  classes  W  alone  can  lead  to  consistently  poor 
decisions. 

We  generate  two  data  sets  with  different  non-target  distributions 
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Switchboard  Target/Non-Target  Classification 
(2  Models  per  Speaker,  Min.  Error  Rate) 
Train:4111  Test:726 


Fig.  3.  Operating  point:  C12  =  0.024,  C21  =  0.088 


via  biased  sampling.  Rather  than  random  allocation,  we  assign  a 
fixed  proportion  of  those  points  on  which  the  classifier  fails  to  sub¬ 
sets  of  the  data.  In  particular,  we  divide  the  Switchboard  data  into 
two  halves  Si  and  S2,  but  allocate  exactly  one-third  of  classifier  er¬ 
rors  to  51.  This  creates  overall  error  rates  of  3.6%  and  7.1%  on  Si 
and  S2  respectively. 

We  partition  each  Si  into  equal  pieces  Li  and  Yt.  Classifier 
performance  C  is  modeled  independently  on  each  Li  to  allow  for 
changes.  We  examine  the  results  of  estimation  on  1000  partitions 
(Li,  L2,  Yi,  Y2)  for  various  fixed  values  of  true  target  proportion  vi 
on  Yi  and  Y2. 

If  we  set  target  proportions  vi  =  0.03  in  Yi  and  vi  =  0.01 
in  Yt,  the  mean  value  of  w\  is  0.064  in  Yl  and  0.080  in  Y2.  The 
mean  value  of  u*  is  0.031  in  Yl  and  0.015  in  Y2.  Bayes  estimation 
decides  Yi  is  the  richer  source  of  target  voice  cuts  87.7%  of  the  time; 
hypothesized  classes  select  it  only  0.6%  of  the  time. 

With  target  proportions  vi  =  0.047  in  Yl  and  vi  =  0.01  in  Yt, 
the  mean  value  of  ini  is  0.079  in  each;  true  target  difference  exactly 
matches  the  difference  in  bias.  Means  for  iq  are  0.047  and  0.015 
respectively.  Bayes  estimation  selects  the  richer  source  99.2%  of  the 
time.  Hypothesized  classes  have  essentially  random  performance 
(correct  47.2%  of  the  time).  Only  as  the  difference  in  true  class  pro¬ 
portions  increases  beyond  3.5%,  do  hypothesized  classes  detect  the 
difference.  We  see  that  given  rare  target  classes,  a  difference  in  false 
alarm  rate  can  overwhelm  the  difference  in  true  target  proportion. 

5.  CONCLUSIONS 

This  paper  has  addressed  the  problem  of  estimating  class  proportions 
based  on  the  output  of  an  automated  pattern  classification  system,  for 
example  language,  speaker  or  topic  identification.  We  described  an 
hierarchical  Bayes  model  for  the  true  class  distribution,  which  allows 
construction  of  a  Bayes  estimator  for  the  true  class  proportion. 

This  algorithm  was  experimentally  evaluated  on  a  binary  SID 
task  derived  from  the  Switchboard  corpus.  This  experiment  demon¬ 
strated  that  the  Bayes  estimator  of  target  proportion  is  far  superior 
to  the  hypothesized  target  proportion  from  the  classifier.  The  maxi¬ 


mum  RMSE  was  reduced  by  a  factor  of  5,  and  the  range  in  RMSE 

(as  a  measure  of  variability)  is  reduced  by  a  factor  of  4. 
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